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Abstract 


The use of conventional maximum likelihood 
estimates hinders the performance of existing 
phrase-based translation models. For lack of 
sufficient training data, most models only con- 
sider a small amount of context. As a par- 
tial remedy, we explore here several contin- 
uous space translation models, where transla- 
tion probabilities are estimated using a con- 
tinuous representation of translation units in 
lieu of standard discrete representations. In 
order to handle a large set of translation units, 
these representations and the associated esti- 
mates are jointly computed using a multi-layer 
neural network with a SOUL architecture. In 
small scale and large scale English to French 
experiments, we show that the resulting mod- 
els can effectively be trained and used on top 
of a n-gram translation system, delivering sig- 
nificant improvements in performance. 


1 Introduction 


The phrase-based approach to statistical machine 
translation (SMT) is based on the following infer- 
ence rule, which, given a source sentence s, selects 
the target sentence t and the underlying alignment a 
maximizing the following term: 


K 
P(t, als) = 7 exp (X AvJe(s.t,a)), (1) 
k=1 


where K feature functions (f) are weighted by a 
set of coefficients (Ay), and Z is a normalizing fac- 
tor. The phrase-based approach differs from other 
approaches by the hidden variables of the translation 
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process: the segmentation of a parallel sentence pair 
into phrase pairs and the associated phrase align- 
ments. 

This formulation was introduced in (Zens et al., 
2002) as an extension of the word based mod- 
els (Brown et al., 1993), then later motivated within 
a discriminative framework (Och and Ney, 2004). 
One motivation for integrating more feature func- 
tions was to improve the estimation of the translation 
model P(t|s), which was initially based on relative 
frequencies, thus yielding poor estimates. 

This is because the units of phrase-based mod- 
els are phrase pairs, made of a source and a tar- 
get phrase; such units are viewed as the events of 
discrete random variables. The resulting representa- 
tions of phrases (or words) thus entirely ignore the 
morphological, syntactic and semantic relationships 
that exist among those units in both languages. This 
lack of structure hinders the generalization power of 
the model and reduces its ability to adapt to other 
domains. Another consequence is that phrase-based 
models usually consider a very restricted context!. 

This is a general issue in statistical Natural Lan- 
guage Processing (NLP) and many possible reme- 
dies have been proposed in the literature, such as, 
for instance, using smoothing techniques (Chen and 
Goodman, 1996), or working with linguistically en- 
riched, or more abstract, representations. In statisti- 
cal language modeling, another line of research con- 
siders numerical representations, trained automat- 
ically through the use of neural network (see eg. 


‘typically a small number of preceding phrase pairs for the 
n-gram based approach (Crego and Marifio, 2006), or no con- 
text at all, for the standard approach of (Koehn et al., 2007). 
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(Collobert et al., 2011)). An influential proposal, 
in this respect, is the work of (Bengio et al., 2003) 
on continuous space language models. In this ap- 
proach, n-gram probabilities are estimated using a 
continuous representation of words in lieu of stan- 
dard discrete representations. Experimental results, 
reported for instance in (Schwenk, 2007) show sig- 
nificant improvements in speech recognition appli- 
cations. Recently, this model has been extended in 
several promising ways (Mikolov et al., 2011; Kuo 
et al., 2010; Liu et al., 2011). In the context of SMT, 
Schwenk et al. (2007) is the first attempt to esti- 
mate translation probabilities in a continuous space. 
However, because of the proposed neural architec- 
ture, the authors only consider a very restricted set 
of translation units, and therefore report only a slight 
impact on translation performance. The recent pro- 
posal of (Le et al., 2011a) seems especially relevant, 
as it is able, through the use of class-based models, 
to handle arbitrarily large vocabularies and opens the 
way to enhanced neural translation models. 

In this paper, we explore various neural architec- 
tures for translation models and consider three dif- 
ferent ways to factor the joint probability P(s,t) 
differing by the units (respectively phrase pairs, 
phrases or words) that are projected in continuous 
spaces. While these decompositions are theoreti- 
cally straightforward, they were not considered in 
the past because of data sparsity issues and of the 
resulting weaknesses of conventional maximum like- 
lihood estimates. Our main contribution is then to 
show that such joint distributions can be efficiently 
computed by neural networks, even for very large 
context sizes; and that their use yields significant 
performance improvements. These models are eval- 
uated in a n-best rescoring step using the framework 
of n-gram based systems, within which they inte- 
grate easily. Note, however that they could be used 
with any phrase-based system. 

The rest of this paper is organized as follows. We 
first recollect, in Section 2, the n-gram based ap- 
proach, and discuss various implementations of this 
framework. We then describe, in Section 3, the neu- 
ral architecture developed and explain how it can be 
made to handle large vocabulary tasks as well as lan- 
guage models over bilingual units. We finally re- 
port, in Section 4, experimental results obtained on 
a large-scale English to French translation task. 
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2 Variations on the n-gram approach 


Even though n-gram translation models can be 
integrated within standard phrase-based systems 
(Niehues et al., 2011), the n-gram based frame- 
work provides a more convenient way to introduce 
our work and has also been used to build the base- 
line systems used in our experiments. In the n- 
gram based approach (Casacuberta and Vidal, 2004; 
Mariño et al., 2006; Crego and Mariño, 2006), trans- 
lation is divided in two steps: a source reordering 
step and a translation step. Source reordering is 
based on a set of learned rewrite rules that non- 
deterministically reorder the input words so as to 
match the target order thereby generating a lattice 
of possible reorderings. Translation then amounts 
to finding the most likely path in this lattice using a 
n-gram translation model ? of bilingual units. 


2.1 The standard n-gram translation model 


n-gram translation models (TMs) rely on a spe- 
cific decomposition of the joint probability P(s, t), 
where s is a sequence of J reordered source words 
(s1,..., S7) and t contains J target words (t1, ..., tJ). 
This sentence pair is further assumed to be de- 
composed into a sequence of L bilingual units 
called tuples defining a joint segmentation: (s,t) = 
U1,..-, Uy. In the approach of (Mariño et al., 2006), 
this segmentation is a by-product of source reorder- 
ing, and ultimately derives from initial word and 
phrase alignments. In this framework, the basic 
translation units are tuples, which are the analogous 
of phrase pairs, and represent a matching u = (5,t) 
between a source 5 and a target ¢ phrase (see Fig- 
ure 1). Using the n-gram assumption, the joint prob- 
ability of a segmented sentence pair decomposes as: 


L 


P(s,t) = [LPC atini) O) 
jaa 


A first issue with this model is that the elementary 
units are bilingual pairs, which means that the under- 
lying vocabulary, hence the number of parameters, 
can be quite large, even for small translation tasks. 
Due to data sparsity issues, such models are bound 


?Like in the standard phrase-based approach, the translation 
process also involves additional feature functions that are pre- 
sented below. 


to face severe estimation problems. Another prob- 
lem with (2) is that the source and target sides play 
symmetric roles, whereas the source side is known, 
and the target side must be predicted. 


2.2 A factored n-gram translation model 


To overcome some of these issues, the n-gram prob- 
ability in equation (2) can be factored by decompos- 
ing tuples in two (source and target) parts : 


P(ujlui—1, ---, ui n41) = 
POs Si baer ay, 10 


PS Spt, Ga Sis SL) 


Decomposition (3) involves two models: the first 
term represents a TM, the second term is best viewed 
as a reordering model. In this formulation, the TM 
only predicts the target phrase, given its source and 
target contexts. 

Another benefit of this formulation is that the el- 
ementary events now correspond either to source or 
to target phrases, but never to pairs of such phrases. 
The underlying vocabulary is thus obtained as the 
union, rather than the cross product, of phrase in- 
ventories. Finally note that the n-gram probability 


P(uj;|Uj_1, ..-; Ui-n41) could also factor as: 
P(3;|t:, Si-1, ti—1, «-, Si-n+1, ti-n+1) @ 
oP (tsi tit pep eps ti-n+1) 


2.3 A word factored translation model 


A more radical way to address the data sparsity is- 
sues is to take (source and target) words as the basic 
units of the n-gram TM. This may seem to be a step 
backwards, since the transition from word (Brown et 
al., 1993) to phrase-based models (Zens et al., 2002) 
is considered as one of the main recent improvement 
in MT. One important motivation for considering 
phrases rather than words was to capture local con- 
text in translation and reordering. It should then be 
stressed that the decomposition of phrases in words 
is only re-introduced here as a way to mitigate the 
parameter estimation problems. ‘Translation units 
are still pairs of phrases, derived from a bilingual 
segmentation in tuples synchronizing the source and 
target n-gram streams, as defined by equation (3). 
In fact, the estimation policy described in section 3 
will actually allow us to design n-gram models with 
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longer contexts than is typically possible in the con- 
ventional n-gram approach. 

Let ak denote the k® word of source tuple 3;. 
Considering again the example of Figure 1, st, is 
to the source word nobel, s{, is to the source word 
paix, and similarly tî} is the target word peace. We 
finally denote h"~1(t*) the sequence made of the 
n — 1 words preceding ie in the target sentence: in 
Figure 1, h3(t?,) thus refers to the three word con- 
text receive the nobel associated with the target word 
peace. Using these notations, equation (3) is rewrit- 
ten as: 


L Itil 
p(s,t) =] TL CT CEE), hsh) 


Si 


SU (ser AC 1 E), a=! (sf))| (5) 


This decomposition relies on the n-gram assump- 
tion, this time at the word level. Therefore, this 
model estimates the joint probability of a sentence 
pair using two sliding windows of length n, one for 
each language; however, the moves of these win- 
dows remain synchronized by the tuple segmenta- 
tion. Moreover, the context is not limited to the cur- 
rent phrase, and continues to include words in ad- 
jacent phrases. Using the example of Figure 1, the 
contribution of the target phrase {11 = nobel, peace 
to P(s,t) using a 3- gram model is 


P(nobel|[receive, the], [la, paix] ) 
x P(peace|[the, nobel], [la, paix]). 


Likewise, the contribution of the source phrase 
511 =nobel, de, la, paix is: 


P(nobel|[receive, the], [recevoir,le]) 
x P(de|freceive, the], [le,nobel]) 
x P(la[receive, the], [nobel, de]) 
x P(paix||[receive, the], [de.la]). 


A benefit of this new formulation is that the involved 
vocabularies only contain words, and are thus much 
smaller. These models are thus less bound to be af- 
fected by data sparsity issues. While the TM defined 
by equation (5) derives from equation (3), a variation 
can be equivalently derived from equation (4). 


org : a recevoir prix nobel de la paix 
S: : recevoir| | 5,,: nobel de la paix| | 5): prix| 
T: | Ty: receive] | t; nobel peace} | Lo: prize| 


Figure 1: Extract of a French-English sentence pair segmented in bilingual units. The original (org) French sentence 
appears at the top of the figure, just above the reordered source s and target t. The pair (s,t) decomposes into a 
sequence of L bilingual units (tuples) u1, ..., uz. Each tuple u; contains a source and a target phrase: 5; and f;. 


3 The SOUL model 


In the previous section, we defined three different 
n-gram translation models, based respectively on 
equations (2), (3) and (5). As discussed above, a 
major issue with such models is to reliably estimate 
their parameters, the numbers of which grow expo- 
nentially with the order of the model. This problem 
is aggravated in natural language processing, due to 
well known data sparsity issues. In this work, we 
take advantage of the recent proposal of (Le et al., 
2011a): using a specific neural network architecture 
(the Structured OUtput Layer model), it becomes 
possible to handle large vocabulary language mod- 
eling tasks, a solution that we adapt here to MT. 


3.1 Language modeling in a continuous space 


Let V be a finite vocabulary, n-gram language mod- 
els (LMs) define distributions over finite sequences 
of tokens (typically words) wf in V+ as follows: 


L 


P(wr) = | | P(wilwiz,,1) (6) 
i=1 


Modeling the joint distribution of several discrete 
random variables (such as words in a sentence) is 
difficult, especially in NLP applications where Y 
typically contains dozens of thousands words. 

In spite of the simplifying n-gram assump- 
tion, maximum likelihood estimation remains un- 
reliable and tends to underestimate the proba- 
bility of very rare n-grams. Smoothing tech- 
niques, such as Kneser-Ney and Witten-Bell back- 
off schemes (see (Chen and Goodman, 1996) for an 
empirical overview, and (Teh, 2006) for a Bayesian 
interpretation), perform back-off to lower order dis- 
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tributions, thus providing an estimate for the proba- 
bility of these unseen events. 

One of the most successful alternative to date is to 
use distributed word representations (Bengio et al., 
2003), where distributionally similar words are rep- 
resented as neighbors in a continuous space. This 
turns n-grams distributions into smooth functions 
of the word representations. These representations 
and the associated estimates are jointly computed 
in a multi-layer neural network architecture. Fig- 
ure 2 provides a partial representation of this kind 
of model and helps figuring out their principles. To 
compute the probability P(w;|wi_), 41) then — 1 
context words are projected in the same continu- 
ous space using a shared matrix R; these continuous 
word representations are then concatenated to build 
a single vector that represents the context; after a 
non-linear transformation, the probability distribu- 
tion is computed using a softmax layer. 

The major difficulty with the neural network ap- 
proach remains the complexity of inference and 
training, which largely depends on the size of the 
output vocabulary (i.e. the number of words that 
have to be predicted). One practical solution is to re- 
strict the output vocabulary to a short-list composed 
of the most frequent words (Schwenk, 2007). How- 
ever, the usual size of the short-list is under 20k, 
which does not seem sufficient to faithfully repre- 
sent the translation models of section 2. 


3.2 Principles of SOUL 


To circumvent this problem, Structured Output 
Layer (SOUL) LMs are introduced in (Le et al., 
2011a). Following Mnih and Hinton (2008), the 
SOUL model combines the neural network approach 
with a class-based LM (Brown et al., 1992). Struc- 


turing the output layer and using word class informa- 
tion makes the estimation of distributions over the 
entire vocabulary computationally feasible. 


To meet this goal, the output vocabulary is struc- 
tured as a clustering tree, where each word belongs 
to only one class and its associated sub-classes. If 
w; denotes the it? word in a sentence, the sequence 
C1:p(w;) = C1,.-., cp encodes the path for word w; 
in the clustering tree, with D being the depth of the 
tree, cq(w;) a class or sub-class assigned to w;, and 
Cp(w;) being the leaf associated with w; (the word 
itself). The probability of w; given its history h can 
then be computed as: 


P(wi|h) =P(c1(wi)|h) 
D 
x |] P(ca(wa)|h, er:a-1). e 


d=2 


There is a softmax function at each level of the tree 
and each word ends up forming its own class (a leaf). 


The SOUL model, represented on Figure 2, is thus 
the same as for the standard model up to the output 
layer. The main difference lies in the output struc- 
ture which involves several layers with a softmax ac- 
tivation function. The first (class layer) estimates 
the class probability P(ci(w;)|h), while other out- 
put sub-class layers estimate the sub-class probabili- 
ties P(cq(w;)|h, c1:4_1). Finally, the word layers es- 
timate the word probabilities P(cp(w;)|h, c1:D-1). 
Words in the short-list remain special, since each of 
them represents a (final) class. 


Training a SOUL model can be achieved by maxi- 
mizing the log-likelihood of the parameters on some 
training corpus. Following (Bengio et al., 2003), 
this optimization is performed by stochastic back- 
propagation. Details of the training procedure can 
be found in (Le et al., 2011b). 


Neural network architectures are also interesting 
as they can easily handle larger contexts than typical 
n-gram models. In the SOUL architecture, enlarging 
the context mainly consists in increasing the size of 
the projection layer, which corresponds to a simple 
look-up operation. Increasing the context length at 
the input layer thus only causes a linear growth in 
complexity in the worst case (Schwenk, 2007). 
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shortlist 
J 


` sub-class) | 
layers |: 


Figure 2: The architecture of a SOUL Neural Network 
language model in the case of a 4-gram model. 


3.3 Translation modeling with SOUL 


The SOUL architecture was used successfully to 
deliver (monolingual) LMs probabilities for speech 
recognition (Le et al., 2011a) and machine transla- 
tion (Allauzen et al., 2011) applications. In fact, 
using this architecture, it is possible to estimate n- 
gram distributions for any kind of discrete random 
variables, such as a phrase or a tuple. The SOUL ar- 
chitecture can thus be readily used as a replacement 
for the standard n-gram TM described in section 2.1. 
This is because all the random variables are events 
over the same set of tuples. 


Adopting this architecture for the other n-gram 
TM respectively described by equations (3) and (5) 
is more tricky, as they involve two different lan- 
guages and thus two different vocabularies: the pre- 
dicted unit is a target phrase (resp. word), whereas 
the context is made of both source and target phrases 
(resp. words). A subsequent modification of the 
SOUL architecture was thus performed to make up 
for “mixed” contexts: rather than projecting all the 
context words or phrases into the same continuous 
space (using the matrix R, see Figure 2), we used 
two different projection matrices, one for each lan- 
guage. The input layer is thus composed of two vec- 
tors in two different spaces; these two representa- 
tions are then combined through the hidden layer, 
the other layers remaining unchanged. 


4 Experimental Results 


We now turn to an experimental comparison of the 
models introduced in Section 2. We first describe 
the tasks and data that were used, before presenting 
our n-gram based system and baseline set-up. Our 
results are finally presented and discussed. 

Let us first emphasize that the design and inte- 
gration of a SOUL model for large SMT tasks is 
far from easy, given the computational cost of com- 
puting n-gram probabilities, a task that is performed 
repeatedly during the search of the best translation. 
Our solution was to resort to a two pass approach: 
the first pass uses a conventional back-off n-gram 
model to produce a k-best list (the k most likely 
translations); in the second pass, the probability of 
a m-gram SOUL model is computed for each hy- 
pothesis, added as a new feature and the k-best list 
is accordingly reordered’. In all the following ex- 
periments, we used a fixed context size for SOUL of 
m = 10, and used k = 300. 


4.1 Tasks and corpora 


The two tasks considered in our experiments 
are adapted from the text translation track of 
IWSLT 2011 from English to French (the ” TED” 
talk task): a small data scenario where the only 
training data is a small in-domain corpus; and a large 
scale condition using all the available training data. 
In this article, we only provide a short overview of 
the task; all the necessary details regarding this eval- 
uation campaign are on the official website‘. 

The in-domain training data consists of 107,058 
sentence pairs, whereas for the large scale task, all 
the data available for the WMT 2011 evaluation” are 
added. For the latter task, the available parallel data 
includes a large Web corpus, referred to as the Gi- 
gaWord parallel corpus. This corpus is very noisy 
and is accordingly filtered using a simple perplexity 
criterion as explained in (Allauzen et al., 2011). The 
total amount of training data is approximately 11.5 
million sentence pairs for the bilingual part, and 
about 2.5 billion of words for the monolingual part. 
As the provided development data was quite small, 


>The probability estimated with the SOUL model is added as 
a new feature to the score of an hypothesis given by Equation 1. 
The coefficients are retuned before the reranking step. 

“iwslt201 L.org 

Swww.statmt.org/wmtl 1/ 
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Model Vocabulary size 
Small task Large task 
STC trg STC trg 
Standard 317k 8847k 
Phrase factored | 96k || 131k | 4262k || 3972k 
Word factored 45k 53k | 505k 492k 


Table 1: Vocabulary sizes for the English to French tasks 
obtained with various SOUL translation (TM) models. 
For the factored models, sizes are indicated for both 
source (src) and target (trg) sides. 


development and test set were inverted, and we fi- 
nally used a development set of 1,664 sentences, and 
a test set of 934 sentences. The table 1 provides the 
sizes of the different vocabularies. The n-gram TMs 
are estimated over a training corpus composed of tu- 
ple sequences. Tuples are extracted from the word- 
aligned parallel data (using MGIZA++° with default 
settings) in such a way that a unique segmentation 
of the bilingual corpus is achieved, allowing to di- 
rectly estimate bilingual n-gram models (see (Crego 
and Marifio, 2006) for details). 


4.2 n-gram based translation system 


The n-gram based system used here is based on an 
open source implementation described in (Crego et 
al., 2011). In a nutshell, the TM is implemented as 
a stochastic finite-state transducer trained using a n- 
gram model of (source, target) pairs as described in 
section 2.1. Training this model requires to reorder 
source sentences so as to match the target word or- 
der. This is performed by a non-deterministic finite- 
state reordering model, which uses part-of-speech 
information generated by the TreeTagger to gener- 
alize reordering patterns beyond lexical regularities. 

In addition to the TM, fourteen feature functions 
are included: a farget-language model; four lexi- 
con models; six lexicalized reordering models (Till- 
mann, 2004; Crego et al., 2011); a distance-based 
distortion model; and finally a word-bonus model 
and a tuple-bonus model. The four lexicon mod- 
els are similar to the ones used in standard phrase- 
based systems: two scores correspond to the rela- 
tive frequencies of the tuples and two lexical weights 
are estimated from the automatically generated word 


® geek.kyloo.net/software 


alignments. The weights associated to feature func- 
tions are optimally combined using the Minimum 
Error Rate Training (MERT) (Och, 2003). All the 
results in BLEU are obtained as an average of 4 op- 
timization runs’. 

For the small task, the target LM is a standard 
4-gram model estimated with the Kneser-Ney dis- 
counting scheme interpolated with lower order mod- 
els (Kneser and Ney, 1995; Chen and Goodman, 
1996), while for the large task, the target LM is ob- 
tained by linear interpolation of several 4-gram mod- 
els (see (Lavergne et al., 2011) for details). As for 
the TM, all the available parallel corpora were sim- 
ply pooled together to train a 3-gram model. Results 
obtained with this large-scale system were found to 
be comparable to some of the best official submis- 
sions. 


4.3 Small task evaluation 


Table 2 summarizes the results obtained with the 
baseline and different SOUL models, TMs and a 
target LM. The first comparison concerns the stan- 
dard n-gram TM, defined by equation (2), when es- 
timated conventionally or as a SOUL model. Adding 
the latter model yields a slight BLEU improvement 
of 0.5 point over the baseline. When the SOUL TM 
is phrased factored as defined in equation (3) the 
gain is of 0.9 BLEU point instead. This difference 
can be explained by the smaller vocabularies used 
in the latter model, and its improved robustness to 
data sparsity issues. Additional gains are obtained 
with the word factored TM defined by equation (5): 
a BLEU improvement of 0.8 point over the phrase 
factored TM and of 1.7 point over the baseline are 
respectively achieved. We assume that the observed 
improvements can be explained by the joint effect of 
a better smoothing and a longer context. 

The comparison with the condition where we only 
use a SOUL target LM is interesting as well. Here, 
the use of the word factored TM still yields to a 0.6 
BLEU improvement. This result shows that there 
is an actual benefit in smoothing the TM estimates, 
rather than only focus on the LM estimates. 

Table 3 reports a comparison among the dif- 
ferent components and variations of the word 


TThe standard deviations are below 0.1 and thus omitted in 
the reported results. 
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Model BLEU 
dev || test 
Baseline 31.4 || 25.8 
Adding a SOUL model 
Standard TM 32.0 || 26.3 
Phrase factored TM || 32.7 | 26.7 
Word factored TM 33.6 | 27.5 
Target LM 32.6 || 26.9 


Table 2: Results for the small English to French task ob- 
tained with the baseline system and with various SOUL 
translation (TM) or target language (LM) models. 


Model BLEU 

dev || test 

Adding a SOUL model 

+ P(A TE), kh” ela) ] 32.6 | 26.9 
+ P(sk|a (El, an (st)) || 32.1 | 26.2 
+ the combination of both 33.2 | 27.5 
+ P(E h (sE), h EL) | 31.7 | 26.1 
+ P(th|A"-1(s!),h™-1(tk)) | 32.7 || 26.8 
+ the combination of both 33.4 || 27.2 


Table 3: Comparison of the different components and 
variations of the word factored translation model. 


factored TM. In the upper part of the table, 
the model defined by equation (5) is evaluated 
component by component: first the translation 
term P(t?|h"-1(t?),h"-*(si,,)), then its distor- 
tion counterpart P(s*|h”~1(t}),h”~1(s?)) and fi- 
nally their combination, which yields the joint prob- 
ability of the sentence pair. Here, we observe that 
the best improvement is obtained with the transla- 
tion term, which is 0.7 BLEU point better than the 
latter term. Moreover, the use of just a translation 
term only yields a BLEU score equal to the one ob- 
tained with the SOUL target LM, and its combina- 
tion with the distortion term is decisive to attain the 
additional gain of 0.6 BLEU point. The lower part of 
the table provides the same comparison, but for the 
variation of the word factored TM. Besides a similar 
trend, we observe that this variation delivers slightly 
lower results. This can be explained by the restricted 
context used by the translation term which no longer 
includes the current source phrase or word. 


Model BLEU 
dev || test 

Baseline 33.7 || 28.2 

Adding a word factored SOUL TM 

+ in-domain TM 35.2 || 29.4 

+ out-of-domain TM 34.8 || 29.1 

+ out-of-domain adapted TM || 35.5 | 29.8 

Adding a SOUL LM 
+ out-of-domain adapted LM || 35.0 | 29.2 


Table 4: Results for the large English to French trans- 
lation task obtained by adding various SOUL translation 
and language models (see text for details). 


4.4 Large task evaluation 


For the large-scale setting, the training material in- 
creases drastically with the use of the additional out- 
of-domain data for the baseline models. Results are 
summarized in Table 4. The first observation is the 
large increase of BLEU (+2.4 points) for the base- 
line system over the small-scale baseline. For this 
task, only the word factored TM is evaluated since 
it significantly outperforms the others on the small 
task (see section 4.3). 

In a first scenario, we use a word factored TM, 
trained only on the small in-domain corpus. Even 
though the training corpus of the baseline TM is one 
hundred times larger than this small in-domain data, 
adding the SOUL TM still yields a BLEU increase 
of 1.2 point’. In a second scenario, we increase the 
training corpus for the SOUL, and include parts of 
the out-of-domain data (the WMT part). The result- 
ing BLEU score is here slightly worse than the one 
obtained with just the in-domain TM, yet delivering 
improved results with the respect to the baseline. 

In a last attempt, we amended the training regime 
of the neural network. In a fist step, we trained con- 
ventionally a SOUL model using the same out-of- 
domain parallel data as before. We then adapted this 
model by running five additional epochs of the back- 
propagation algorithm using the in-domain data. Us- 
ing this adapted model yielded our best results to 
date with a BLEU improvement of 1.6 points over 
the baseline results. Moreover, the gains obtained 
using this simple domain adaptation strategy are re- 


Note that the in-domain data was already included in the 
training corpus of the baseline TM. 
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spectively of +0.4 and +0.8 BLEU, as compared 
with the small in-domain model and the large out- 
of-domain model. These results show that the SOUL 
TM can scale efficiently and that its structure is well 
suited for domain adaptation. 


5 Related work 


To the best of our knowledge, the first work on ma- 
chine translation in continuous spaces is (Schwenk 
et al., 2007), where the authors introduced the model 
referred here to as the the standard n-gram trans- 
lation model in Section 2.1. This model is an ex- 
tension of the continuous space language model 
of (Bengio et al., 2003), the basic unit is the tuple 
(or equivalently the phrase pair). The resulting vo- 
cabulary being too large to be handled by neural net- 
works without a structured output layer, the authors 
had thus to restrict the set of the predicted units to a 
8k short-list . Moreover, in (Zamora-Martinez et al., 
2010), the authors propose a tighter integration of a 
continuous space model with a n-gram approach but 
only for the target LM. 


A different approach, described in (Sarikaya et 
al., 2008), divides the problem in two parts: first the 
continuous representation is obtained by an adapta- 
tion of the Latent Semantic Analysis; then a Gaus- 
sian mixture model is learned using this continu- 
ous representation and included in a hidden Markov 
model. One problem with this approach is the sep- 
aration between the training of the continuous rep- 
resentation on the one hand, and the training of the 
translation model on the other hand. In comparison, 
in our approach, the representation and the predic- 
tion are learned in a joined fashion. 


Other ways to address the data sparsity issues 
faced by translation model were also proposed in the 
literature. Smoothing is obviously one possibility 
(Foster et al., 2006). Another is to use factored lan- 
guage models, introduced in (Bilmes and Kirchhoff, 
2003), then adapted for translation models in (Koehn 
and Hoang, 2007; Crego and Yvon, 2010). Such ap- 
proaches require to use external linguistic analysis 
tools which are error prone; moreover, they did not 
seem to bring clear improvements, even when trans- 
lating into morphologically rich languages. 


6 Conclusion 


In this paper, we have presented possible ways to use 
a neural network architecture as a translation model. 
A first contribution was to produce the first large- 
scale neural translation model, implemented here in 
the framework of the n-gram based models, tak- 
ing advantage of a specific hierarchical architecture 
(SOUL). By considering several decompositions of 
the joint probability of a sentence pair, several bilin- 
gual translation models were presented and dis- 
cussed. As it turned out, using a factorization which 
clearly distinguishes the source and target sides, and 
only involves word probabilities, proved an effec- 
tive remedy to data sparsity issues and provided sig- 
nificant improvements over the baseline. Moreover, 
this approach was also experimented within the sys- 
tems we submitted to the shared translation task of 
the seventh workshop on statistical machine trans- 
lation (WMT 2012). These experimentations in a 
large scale setup and for different language pair cor- 
roborate the improvements reported in this article. 

We also investigated various training regimes for 
these models in a cross domain adaptation setting. 
Our results show that adapting an out-of-domain 
SOUL TM is both an effective and very fast way to 
perform bilingual model adaptation. Adding up all 
these novelties finally brought us a 1.6 BLEU point 
improvement over the baseline. Even though our 
experiments were carried out only within the frame- 
work of n-gram based MT systems, using such mod- 
els in other systems is straightforward. Future work 
will thus aim at introducing them into conventional 
phrase-based systems, such as Moses (Koehn et al., 
2007). Given that Moses only implicitly uses n- 
gram based information, adding SOUL translation 
models is expected to be even more helpful. 
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